Autonomic method and apparatus for counting branch instructions to generate branch statistics meant to improve branch predictions

ABSTRACT

A method, apparatus, and computer instructions for autonomically counting selected branch instructions executed in a processor to improve branch predictions. Counters are provided to count branch instructions that are executed in a processor to collect branch statistics. A set of branch statistics fields is allocated to associate with a branch instruction. When a program is executed, the stored statistics allows the program to look at the branch statistics in the counter to perform branch prediction. Hence, a user may use branch statistics values from the hardware counter to perform analysis on application code.

CROSS REFERENCE TO RELATED APPLICATIONS

The present invention is related to the following applications entitled“Method and Apparatus for Counting Instruction Execution and DataAccesses”, Ser. No. 10/675,777, filed on Sep. 30, 2003; “Method andApparatus for Selectively Counting Instructions and Data Accesses”, Ser.No. 10/674,604, filed on Sep. 30, 2003; “Method and Apparatus forGenerating Interrupts Upon Execution of Marked Instructions and UponAccess to Marked Memory Locations”, Ser. No. 10/675,831, filed on Sep.30, 2003; “Method and Apparatus for Counting Data Accesses andInstruction Executions that Exceed a Threshold”, Ser. No. 10/675,778,filed on Sep. 30, 2003; “Method and Apparatus for Counting Execution ofSpecific Instructions and Accesses to Specific Data Locations”, Ser. No.10/675,776, filed on Sep. 30, 2003; “Method and Apparatus for DebugSupport for Individual Instructions and Memory Locations”, Ser. No.10/675,751, filed on Sep. 30, 2003; “Method and Apparatus toAutonomically Select Instructions for Selective Counting”, Ser. No.10/675,721, filed on Sep. 30, 2003; “Method and Apparatus toAutonomically Count Instruction Execution for Applications”, Ser. No.10/675,642, filed on Sep. 30, 2003; “Method and Apparatus toAutonomically Take an Exception on Specified Instructions”, Ser. No.10/675,606, filed on Sep. 30, 2003; “Method and Apparatus toAutonomically Profile Applications”, Ser. No. 10/675,783, filed on Sep.30, 2003; “Method and Apparatus for Counting Instruction and MemoryLocation Ranges”, Ser. No. 10/675,872, filed on Sep. 30, 2003;“Autonomic Method and Apparatus for Hardware Assist for Patching Code”,Ser. No. 10/757,171, filed on Jan. 14, 2004, and “Autonomic Method andApparatus for Local Program Code Reorganization Using Branch Count PerInstruction Hardware”, Ser. No. 10/757,156, filed on Sep. 30, 2003. Allof the above related applications are assigned to the same assignee, andincorporated herein by reference.

BACKGROUND OF THE INVENTION

1. Technical Field

The present invention relates generally to an improved data processingsystem and, in particular, to a method and system for improvingperformance of the processor in a data processing system. Still moreparticularly, the present invention relates to a method, apparatus, andcomputer instructions for improving branch predictions by autonomicallycounting branch instructions executed in a processor.

2. Description of Related Art

In a pipelined processor system, instructions are often prefetched frommemory to keep the pipeline busy. However, a branch instruction maycause a pipeline to stall. A branch instruction is an instruction thatloads a new value in the program counter. As a result, the processorfetches and executes the instruction at this new address, instead of theinstruction at the location that follows the branch instruction insequential address order. A branch instruction may be conditional orunconditional. A conditional branch instruction causes an instruction tobranch or jump to another location of code if a specified condition issatisfied. If the condition is not satisfied, the next instruction insequential order is fetched and executed.

Branch instructions often cause the pipeline to stall because the branchcondition may depend on the result of preceding instruction. Thedecision to branch cannot be made until the execution of thatinstruction has been completed. Therefore, a technique known as branchprediction is used to predict whether or not a particular branch will betaken. A speculative execution is performed to take advantage of branchprediction by executing the instruction before the processor is certainthat they are in the correct execution path. Thus, if a branch is takenmore than 90 percent of the time, it is predicted to be taken andexecuted by the processor prior to reaching the instruction.

Conventionally, branch prediction may be performed in two ways. One wayis known as static branch prediction. This approach is performed by thecompiler at compile time, which looks at the OP code word of theinstruction to indicate whether this branch should be predicted as takenor not taken. The prediction result is the same every time a givenbranch instruction is encountered. Another approach of branch predictionis known as dynamic branch prediction, which is performed at run time,by keeping track of the result of the branch decision the last time thatinstruction was executed and assuming that the decision is likely to bethe same this time. The prediction result may be different each time theinstruction is encountered.

In order to perform dynamic branch prediction, several techniques havebeen introduced in the prior art. One of which is a branch predictionbuffer, which utilizes a buffer or cache indexed by lower portion of theaddress of the branch instruction to indicate whether the branch wasrecently taken or not. However, this technique requires a special cachethat would be accessed during fetching and flushed after the predictionsare complete.

Another existing technique for performing dynamic branch prediction usesa branch target buffer, which is similar to a cache, except the value inthe cache includes the address of the next instruction instead of thecontents of the memory location. Also, the instruction itself may bestored instead of the address. This approach is known as branch folding.However, none of the currently existing techniques provide a solutionfor branch prediction at the instruction level, where detailed branchstatistics are collected per branch instruction. In addition, none ofthe currently existing techniques provides a running history of branchprediction by associating branch statistic fields with branchinstructions, so that better branch predictions may be performed bystoring branch prediction values associated with each branch instructionin a dedicated memory location.

Therefore, it would be advantageous to have an improved method,apparatus and computer instructions for counting branch instructions toimprove branch prediction, so that localized branch prediction may beperformed at the instruction level during code execution and branchstatistics may be collected later on to optimize performance of thesystem.

SUMMARY OF THE INVENTION

The present invention provides a method, apparatus, and computerinstructions for improving branch predictions by autonomically countingbranch instruction executed in a processor. In a preferred embodiment,selected pieces of code are identified for branch statistics, countersare used to count the number of times the identified branches are takenor not taken during program execution, and a set of branch statisticsper branch instruction are derived based on the count. The branch countassociated with the branch instruction is incremented when a branch istaken and decremented when a branch is not taken. Hence, the branchprediction field is updated. A running history of branch statistics iscollected during program execution, which may help to improve branchpredictions of a program. In addition, an application may switchhardware counter's mode of operation at run time to take a different setof branches for a given conditional branch instruction.

BRIEF DESCRIPTION OF THE DRAWINGS

The novel features believed characteristic of the invention are setforth in the appended claims. The invention itself, however, as well asa preferred mode of use, further objectives and advantages thereof, willbest be understood by reference to the following detailed description ofan illustrative embodiment when read in conjunction with theaccompanying drawings, wherein:

FIG. 1 is an exemplary block diagram of a data processing system inwhich the present invention may be implemented;

FIG. 2 is an exemplary block diagram of a processor system forprocessing information in accordance with a preferred embodiment of thepresent invention;

FIG. 3 is an exemplary diagram illustrating one mechanism of associatingbranch statistics with a branch instruction in accordance with apreferred embodiment of the present invention;

FIG. 4 an exemplary diagram illustrating example branch statistics isdepicted in accordance with a preferred embodiment of the presentinvention

FIG. 5 is a flowchart outlining an exemplary process for counting branchinstructions to improve branch predictions in accordance with apreferred embodiment of the present invention; and

FIG. 6 is a flowchart outlining an exemplary process for switching modesof operation by application software in accordance with a preferredembodiment of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

The present invention improves branch predictions by autonomicallycounting a selected set of branch instructions executed in a processor.In a preferred embodiment, counters are used to count the number oftimes branches are taken or not taken during program execution and a setof branch statistics per branch instruction are derived based on thecount. The branch count associated with the branch instruction isincremented when a branch is taken and decremented when a branch is nottaken. This information is used as data for predicting whether a branchwill be taken, and the result of this prediction is located in a branchprediction field. Hence, the branch prediction field is updatedaccording to data of the hardware counters. A running history of branchstatistics is collected during program execution, which may help toimprove branch predictions of a program. In addition, an application mayswitch a hardware counter's mode of operation at run time to take adifferent set of branches for a given conditional branch instruction.

The present invention is preferably implemented on a computer system,such as a client or server in a client-server network environment. Withreference now to FIG. 1, an exemplary block diagram of a data processingsystem is shown in which the present invention may be implemented.Client 100 is an example of a computer, in which code or instructionsimplementing the processes of the present invention may be located.Client 100 employs a peripheral component interconnect (PCI) local busarchitecture. Although the depicted example employs a PCI bus, other busarchitectures such as Accelerated Graphics Port (AGP) and IndustryStandard Architecture (ISA) may be used. Processor 102 and main memory104 are connected to PCI local bus 106 through PCI bridge 108. PCIbridge 108 also may include an integrated memory controller and cachememory for processor 102. Additional connections to PCI local bus 106may be made through direct component interconnection or through add-inboards. In the depicted example, local area network (LAN) adapter 110,small computer system interface SCSI host bus adapter 112, and expansionbus interface 114 are connected to PCI local bus 106 by direct componentconnection. In contrast, audio adapter 116, graphics adapter 118, andaudio/video adapter 119 are connected to PCI local bus 106 by add-inboards inserted into expansion slots. Expansion bus interface 114provides a connection for a keyboard and mouse adapter 120, modem 122,and additional memory 124. SCSI host bus adapter 112 provides aconnection for hard disk drive 126, tape drive 128, and CD-ROM drive130. Typical PCI local bus implementations will support three or fourPCI expansion slots or add-in connectors.

An operating system runs on processor 102 and is used to coordinate andprovide control of various components within data processing system 100in FIG. 1. The operating system may be a commercially availableoperating system such as Windows XP, which is available from MicrosoftCorporation. An object oriented programming system such as Java may runin conjunction with the operating system and provides calls to theoperating system from Java programs or applications executing on client100. “Java” is a trademark of Sun Microsystems, Inc. Instructions forthe operating system, the object-oriented programming system, andapplications or programs are located on storage devices, such as harddisk drive 126, and may be loaded into main memory 104 for execution byprocessor 102.

Those of ordinary skill in the art will appreciate that the hardware inFIG. 1 may vary depending on the implementation. Other internal hardwareor peripheral devices, such as flash read-only memory (ROM), equivalentnonvolatile memory, or optical disk drives and the like, may be used inaddition to or in place of the hardware depicted in FIG. 1. Also, theprocesses of the present invention may be applied to a multiprocessordata processing system.

For example, client 100, if optionally configured as a network computer,may not include SCSI host bus adapter 112, hard disk drive 126, tapedrive 128, and CD-ROM 130. In that case, the computer, to be properlycalled a client computer, includes some type of network communicationinterface, such as LAN adapter 110, modem 122, or the like. As anotherexample, client 100 may be a stand-alone system configured to bebootable without relying on some type of network communicationinterface, whether or not client 100 comprises some type of networkcommunication interface. As a further example, client 100 may be apersonal digital assistant (PDA), which is configured with ROM and/orflash ROM to provide non-volatile memory for storing operating systemfiles and/or user-generated data. The depicted example in FIG. 1 andabove-described examples are not meant to imply architecturallimitations.

The processes of the present invention are performed by processor 102using computer implemented instructions, which may be located in amemory such as, for example, main memory 104, memory 124, or in one ormore peripheral devices 126-130.

Turning next to FIG. 2, an exemplary block diagram of a processor systemfor processing information is depicted in accordance with a preferredembodiment of the present invention. Processor 210 may be implemented asprocessor 102 in FIG. 1.

In a preferred embodiment, processor 210 is a single integrated circuitsuperscalar microprocessor. Accordingly, as discussed further hereinbelow, processor 210 includes various units, registers, buffers,memories, and other sections, all of which are formed by integratedcircuitry. Also, in the preferred embodiment, processor 210 operatesaccording to reduced instruction set computer (“RISC”) techniques. Asshown in FIG. 2, system bus 211 is connected to a bus interface unit(“BIU”) 212 of processor 210. BIU 212 controls the transfer ofinformation between processor 210 and system bus 211.

BIU 212 is connected to an instruction cache 214 and to data cache 216of processor 210. Instruction cache 214 outputs instructions tosequencer unit 218. In response to such instructions from instructioncache 214, sequencer unit 218 selectively outputs instructions to otherexecution circuitry of processor 210.

In addition to sequencer unit 218, in the preferred embodiment, theexecution circuitry of processor 210 includes multiple execution units,namely a branch unit 220, a fixed-point unit A (“FXUA”) 222, afixed-point unit B (“FXUB”) 224, a complex fixed-point unit (“CFXU”)226, a load/store unit (“LSU”) 228, and a floating-point unit (“FPU”)230. FXUA 222, FXUB 224, CFXU 226, and LSU 228 input their sourceoperand information from general-purpose architectural registers(“GPRs”) 232 and fixed-point rename buffers 234. Moreover, FXUA 222 andFXUB 224 input a “carry bit” from a carry bit (“CA”) register 239. FXUA222, FXUB 224, CFXU 226, and LSU 228 output results (destination operandinformation) of their operations for storage at selected entries infixed-point rename buffers 234. Also, CFXU 226 inputs and outputs sourceoperand information and destination operand information to and fromspecial-purpose register processing unit (“SPR unit”) 237.

FPU 230 inputs its source operand information from floating-pointarchitectural registers (“FPRs”) 236 and floating-point rename buffers238. FPU 230 outputs results (destination operand information) of itsoperation for storage at selected entries in floating-point renamebuffers 238.

In response to a Load instruction, LSU 228 inputs information from datacache 216 and copies such information to selected ones of rename buffers234 and 238. If such information is not stored in data cache 216, thendata cache 216 inputs (through BIU 212 and system bus 211) suchinformation from a system memory 260 connected to system bus 211.Moreover, data cache 216 is able to output (through BIU 212 and systembus 211) information from data cache 216 to system memory 260 connectedto system bus 211. In response to a Store instruction, LSU 228 inputsinformation from a selected one of GPRs 232 and FPRs 236 and copies suchinformation to data cache 216.

Sequencer unit 218 inputs and outputs information to and from GPRs 232and FPRs 236. From sequencer unit 218, branch unit 220 inputsinstructions and signals indicating a present state of processor 210. Inresponse to such instructions and signals, branch unit 220 outputs (tosequencer unit 218) signals indicating suitable memory addresses storinga sequence of instructions for execution by processor 210. In responseto such signals from branch unit 220, sequencer unit 218 inputs theindicated sequence of instructions from instruction cache 214. If one ormore of the sequence of instructions is not stored in instruction cache214, then instruction cache 214 inputs (through BIU 212 and system bus211) such instructions from system memory 260 connected to system bus211.

In response to the instructions input from instruction cache 214,sequencer unit 218 selectively dispatches the instructions to selectedones of execution units 220, 222, 224, 226, 228, and 230. Each executionunit executes one or more instructions of a particular class ofinstructions. For example, FXUA 222 and FXUB 224 execute a first classof fixed-point mathematical operations on source operands, such asaddition, subtraction, ANDing, ORing and XORing. CFXU 226 executes asecond class of fixed-point operations on source operands, such asfixed-point multiplication and division. FPU 230 executes floating-pointoperations on source operands, such as floating-point multiplication anddivision.

As information is stored at a selected one of rename buffers 234, suchinformation is associated with a storage location (e.g. one of GPRs 232or carry bit (CA) register 242) as specified by the instruction forwhich the selected rename buffer is allocated. Information stored at aselected one of rename buffers 234 is copied to its associated one ofGPRs 232 (or CA register 242) in response to signals from sequencer unit218. Sequencer unit 218 directs such copying of information stored at aselected one of rename buffers 234 in response to “completing” theinstruction that generated the information. Such copying is called“writeback.”

As information is stored at a selected one of rename buffers 238, suchinformation is associated with one of FPRs 236. Information stored at aselected one of rename buffers 238 is copied to its associated one ofFPRs 236 in response to signals from sequencer unit 218. Sequencer unit218 directs such copying of information stored at a selected one ofrename buffers 238 in response to “completing” the instruction thatgenerated the information.

Processor 210 achieves high performance by processing multipleinstructions simultaneously at various ones of execution units 220, 222,224, 226, 228, and 230. Accordingly, each instruction is processed as asequence of stages, each being executable in parallel with stages ofother instructions. Such a technique is called “pipelining.” In asignificant aspect of the illustrative embodiment, an instruction isnormally processed as six stages, namely fetch, decode, dispatch,execute, completion, and writeback.

In the fetch stage, sequencer unit 218 selectively inputs (frominstruction cache 214) one or more instructions from one or more memoryaddresses storing the sequence of instructions discussed furtherhereinabove in connection with branch unit 220, and sequencer unit 218.

In the decode stage, sequencer unit 218 decodes up to four fetchedinstructions.

In the dispatch stage, sequencer unit 218 selectively dispatches up tofour decoded instructions to selected (in response to the decoding inthe decode stage) ones of execution units 220, 222, 224, 226, 228, and230 after reserving rename buffer entries for the dispatchedinstructions' results (destination operand information). In the dispatchstage, operand information is supplied to the selected execution unitsfor dispatched instructions. Processor 210 dispatches instructions inorder of their programmed sequence.

In the execute stage, execution units execute their dispatchedinstructions and output results (destination operand information) oftheir operations for storage at selected entries in rename buffers 234and rename buffers 238 as discussed further hereinabove. In this manner,processor 210 is able to execute instructions out-of-order relative totheir programmed sequence.

In the completion stage, sequencer unit 218 indicates an instruction is“complete.” Processor 210 “completes” instructions in order of theirprogrammed sequence.

In the writeback stage, sequencer 218 directs the copying of informationfrom rename buffers 234 and 238 to GPRs 232 and FPRs 236, respectively.Sequencer unit 218 directs such copying of information stored at aselected rename buffer. Likewise, in the writeback stage of a particularinstruction, processor 210 updates its architectural states in responseto the particular instruction. Processor 210 processes the respective“writeback” stages of instructions in order of their programmedsequence. Processor 210 advantageously merges an instruction'scompletion stage and writeback stage in specified situations.

In the illustrative embodiment, each instruction requires one machinecycle to complete each of the stages of instruction processing.Nevertheless, some instructions (e.g., complex fixed-point instructionsexecuted by CFXU 226) may require more than one cycle. Accordingly, avariable delay may occur between a particular instruction's executionand completion stages in response to the variation in time required forcompletion of preceding instructions.

Completion buffer 248 is provided within sequencer 218 to track thecompletion of the multiple instructions which are being executed withinthe execution units. Upon an indication that an instruction or a groupof instructions have been completed successfully, in an applicationspecified sequential order, completion buffer 248 may be utilized toinitiate the transfer of the results of those completed instructions tothe associated general-purpose registers.

In addition, processor 210 also includes performance monitor unit 240,which is connected to instruction cache 214 as well as other units inprocessor 210. Operation of processor 210 can be monitored utilizingperformance monitor unit 240, which in this illustrative embodiment is asoftware-accessible mechanism capable of providing detailed informationdescriptive of the utilization of instruction execution resources andstorage control. Although not illustrated in FIG. 2, performance monitorunit 240 is coupled to each functional unit of processor 210 to permitthe monitoring of all aspects of the operation of processor 210,including, for example, reconstructing the relationship between events,identifying false triggering, identifying performance bottlenecks,monitoring pipeline stalls, monitoring idle processor cycles,determining dispatch efficiency, determining branch efficiency,determining the performance penalty of misaligned data accesses,identifying the frequency of execution of serialization instructions,identifying inhibited interrupts, and determining performanceefficiency. The events of interest also may include, for example, timefor instruction decode, execution of instructions, branch events, cachemisses, and cache hits.

Performance monitor unit 240 includes an implementation-dependent number(e.g., 2-8) of counters 241-242, labeled PMC1 and PMC2, which areutilized to count occurrences of selected events. Performance monitorunit 240 further includes at least one monitor mode control register(MMCR). In this example, two control registers, MMCRs 243 and 244 arepresent that specify the function of counters 241-242. Counters 241-242and MMCRs 243-244 are preferably implemented as SPRs that are accessiblefor read or write via MFSPR (move from SPR) and MTSPR (move to SPR)instructions executable by CFXU 226. However, in one alternativeembodiment, counters 241-242 and MMCRs 243-244 may be implemented simplyas addresses in I/O space. In another alternative embodiment, thecontrol registers and counters may be accessed indirectly via an indexregister. This embodiment is implemented in the IA-64 architecture inprocessors from Intel Corporation. Counters 241-242 may also be used tocollect branch statistics per instruction when a program is executed.

The present invention provides a method, apparatus, and computerinstructions for autonomically counting branch instructions executed ina processor to improve branch prediction. In one embodiment, themechanism of the present invention provides counters to count the numberof times a branch is taken per branch instruction, in order to deriveother branch statistics per branch instruction in the application code.A set of statistics is allocated to track branch statistics, such as thenumber of times a branch is taken, whether a branch was taken the lasttime the branch instruction was executed, and the branch predictionassociated with the branch instruction.

Turning next to FIG. 3, an exemplary diagram illustrating one mechanismof associating branch statistics with a branch instruction is depictedin accordance with a preferred embodiment of the present invention. Anumber of branch statistics 302-308 will be allocated by the loader in aperformance instrumentation shadow cache 310. Each branch instruction isassociated with a separate set of branch statistics. The performanceinstrumentation shadow cache 310 is a separate area of storage, whichmay be any storage device, such as, for example, a system memory, aflash memory, a cache, or a disk.

When the application code is compiled, meta data 312 is generated by acompiler 322 in an environment running on a client, such as environment320. The meta data maps each branch instruction to corresponding branchstatistics stored in the performance instrumentation shadow cache. Forexample, meta data 312 maps branch instruction 314 to branch statistics304 allocated in performance instrumentation shadow cache 310. Branchstatistics are discussed in further details in FIG. 4. When theprocessor 316 receives an instruction from cache 318, the processor 316checks to see whether a meta data is associated with the instruction, inthis case, branch instruction 314.

When the program is loaded, meta data 312 is prepared by the loader 324so the meta data will be available to incorporate into performanceinstrumentation shadow cache 310 when branch instruction 314 is loadedinto cache 318. Prior to executing the program, the link editor orlinker/loader 324 allocates a work area, such as branch statistics 304,and notifies the processor of environment setup 326, which tells theprocessor to operate in a special mode. Environment setup 326 enablesbranch statistics 304 to be queried by the application at run timethrough the use of libraries 328. Libraries 328 are software modulesthat provide linkage between application programs and the allocated workarea where the branch statistics are stored, such as branch statistics304.

When code is executed in an application, the program may pause executionand examine the code for a branch instruction. The processor sees metadata 312 associated with branch instruction 314 and knows branchstatistics 304 are stored in the performance instrumentation shadowcache 310. If the branch is taken, the branch unit, such as branch unit220 in FIG. 2, notifies performance instrumentation shadow cache 310 ofwhether branch is taken or not taken in a form of a flag and the addressof the branch instruction. Performance instrumentation shadow cache 310then notifies the hardware counter, such as PMC1 241 and PMC2 242 inFIG. 2, to increment the branch count of branch statistics 304 andupdate the branch field of branch statistics 304 to “taken”. The branchcount and the branch field are part of branch statistics 304 and isdescribed further in FIG. 4. If a branch is not taken, performanceinstrumentation shadow cache 310 notifies the hardware counter todecrement branch count of branch statistics 304 and update branch fieldof branch statistics 304 to “not taken”. The next time the same code isexecuted, the program pauses execution and examines the code again forbranch instruction 314. Branch statistics 304 in performanceinstrumentation shadow cache 310 are then queried by the program topredict whether a branch is to be taken or not taken, preferably bycomparing the branch count of branch statistics 304 to a threshold.Based on the result of the prediction, the program updates the branchprediction field of branch statistics 304 and prefetches branchinstruction 314 if prediction indicates the branch is to be taken. Thebranch prediction field is part of branch statistics 304 and isdescribed further in FIG. 4. This cycle continues to collect a historyof branch statistics associated with each branch instruction.

Since the branch statistics collected are stored in the performanceinstrumentation shadow cache area 310, the program may refer to branchstatistics 302-308 at any time to determine whether a branch instructionwill be executed. In addition, branch statistics may be used to analyzethe performance of the application code for future execution.

Furthermore, in another preferred embodiment, when applying branchstatistics to a conditional branch, application software may instructthe hardware to change its mode of operation. For example, one mode maybe “take the branch” and another mode may be “do not take the branch”.The application software may have two separate sets of branchpredictions, which may be used at run time to instruct the hardware tooperate in a particular mode. This capability allows the applicationsoftware to control the hardware counters, such as PMC1 241 and PMC2 242in FIG. 2, by determining which hardware counter to use in one modeversus another mode. This implementation results in separate statisticsand predictions for each branch, one set of predictions corresponding toa first mode, the other set corresponding to a second mode.

For example, the two modes might have different prediction results for agiven branch. Certain events may occur that indicate one or the other ofthe predictions is likely to be correct (based on other information thanonly the statistics). In such a case, one of the modes, with the properprediction for the branch, will be entered, determining that the branchis taken. In this way, various modes of operation, each with specificpredictions that are predetermined for a given number of branches, canbe entered when circumstances warrant.

The application software may switch modes of operation at run time byusing an application programming interface (API). The API retrievesbranch statistics information from the hardware counters usingtechniques described above. Application software may use thisinformation to determine that a desired result will occur by calling adifferent subroutine to take a different path. For example, anapplication may have two pieces of code, one piece that works well withthe branch taken and another one that works well with the branch nottaken. By retrieving the information about the branch statistics, theapplication may update its code at run time to call a differentsubroutine based on the above information. Thus, the API allowsapplication developers to develop applications by sharing knowledge fromthe hardware counters.

In another example, an application may have two hardware counters, suchas PMC1 241 and PMC2 242 in FIG. 2, that collect two different sets ofbranch statistics. When the code is executed, a cache miss occurs andthe application detects an internal state change that causes theapplication to tell the hardware which counter to switch to in order tochoose a predetermined set of branches, based on the comparison of thecounters with a threshold, for example. By using a different set ofbranch statistics provided by the counter, the application may call adifferent subroutine to execute a different set of branch instructions.The criteria of branch prediction may vary from counter to counter. Onecounter may predict a branch is to be taken with a branch count of 5 andanother may predict a branch is to be taken with a branch count of 10.Branch predictions may differ from counter to counter based on thebranch statistic.

Turning next to FIG. 4, an exemplary diagram illustrating example branchstatistics is depicted in accordance with a preferred embodiment of thepresent invention. In this illustrative example implementation, thereare three branch statistic fields associated with each branch statistic:branch field 402, branch prediction field 404, and branch count field406. These branch statistics fields are part of branch statistics, suchas branch statistics 302-308, stored in a separate area of storage, suchas performance instrumentation shadow cache 310 as described in FIG. 3.Branch field 402 indicates whether a branch is taken or not the lasttime the branch instruction, such as branch instruction 314 in FIG. 3,is executed. Branch prediction field 404 indicates the branch predictionmade based on the branch count. There may be three values associatedwith the branch prediction field. A value of “00” indicates that noprevious data is collected for branch instruction 314. A value of “01”indicates a branch is predicted to be taken for branch instruction 314,and a value of “02” indicates a branch is predicted to be not taken forbranch instruction 314. Branch prediction is normally performed beforethe branch is executed. Branch count field 406 indicates the number oftimes a branch is taken when the code for that branch instruction isexecuted. Hardware counters increment or decrement this field based onwhether a branch is taken or not when the code instruction is executed.

With reference to FIG. 5, a flowchart outlining an exemplary process forcounting branch instructions to improve branch predictions is depictedin accordance with a preferred embodiment of the present invention. Asdepicted in FIG. 5, in a preferred embodiment, the process begins whenthe CPU processes instructions from an execution sequence when a programis executed (step 502). The prefetch algorithm of the CPU looks aheadand sees a branch instruction (step 504). The CPU then looks up thebranch count associated with the branch instruction (step 506) and makesa branch prediction based on the branch count (step 508). For example,the branch count can be compared to a predetermined threshold todetermine the prediction. If it is predicted that the branch is to betaken, the CPU prefetches the branch to instructions (step 510). Whenthe CPU executes the branch instruction (step 512), a determination ismade as to whether the branch predicted to be taken was actually takenor not taken (step 514). If the branch is actually taken, the branchunit notifies the cache unit address of the branch instruction and aflag indicating the branch is taken. The cache unit then increments thebranch count associated with the instruction (step 516) and the branchfield is updated to reflect a branch is taken last time the branchinstruction is executed (step 518) and the process terminatesthereafter.

If the branch is actually not taken from step 514, the branch unitnotifies the cache unit of the address of the branch instruction andsets a flag indicating the branch is not taken. The cache unit thendecrements the branch count associated with the instruction (step 520)and the branch field is updated to reflect a branch is not taken lasttime the branch instruction is executed (step 522) and the processterminates thereafter.

Turning next to FIG. 6, a flowchart outlining an exemplary process forswitching modes of operation of a conditional branch by applicationsoftware is depicted in accordance with a preferred embodiment of thepresent invention. As depicted in FIG. 6, in a preferred embodiment, theprocess begins when instructions of a program are executed (step 602).The application software then queries branch prediction statisticscollected by the hardware counters (step 604). Based on the values ofthe branch statistics (as determined from the branch counts collected),the application software decides which subroutine to call (step 606) andexecutes the subroutine (step 608). While the subroutine is executing,application software detects an internal state change that, according toprevious information about the program, indicates that a different setof branches is most likely to be taken (step 610). An example internalstate change may be a cache miss encountered during code execution,wherein the cache miss indicates that certain branches will later betaken. The application software then tells the hardware to switchcounters on the next branch instruction (step 612) to collect adifferent set of branch statistics in order to make branch predictionsthat predict a different set of branches.

Thus, the present invention provides an improved method, apparatus, andcomputer instructions for branch predictions using hardware counters toautonomically collect more detailed branch statistics associated with abranch instruction of a program. The mechanism of the present inventionallows branch predictions to be made at run time based on the runninghistory of branch statistics stored in a performance instrumentationshadow cache, which is accessible by the application. This mechanismallows users to analyze application code autonomically during and afterthe code is executed.

Furthermore, using the innovative features of the present invention,application software may switch modes of operation of conditional branchat run time by toggling hardware counters. A different counter may havea different set of branch statistics that results in different branchpredictions, which in turn causes a different set of branches to bechosen for a different mode of operation.

It is important to note that while the present invention has beendescribed in the context of a fully functioning data processing system,those of ordinary skill in the art will appreciate that the processes ofthe present invention are capable of being distributed in the form of acomputer readable medium of instructions and a variety of forms and thatthe present invention applies equally regardless of the particular typeof signal bearing media actually used to carry out the distribution.Examples of computer readable media include recordable-type media, suchas a floppy disk, a hard disk drive, a RAM, CD-ROMs, DVD-ROMs, andtransmission-type media, such as digital and analog communicationslinks, wired or wireless communications links using transmission forms,such as, for example, radio frequency and light wave transmissions. Thecomputer readable media may take the form of coded formats that aredecoded for actual use in a particular data processing system.

The description of the present invention has been presented for purposesof illustration and description, and is not intended to be exhaustive orlimited to the invention in the form disclosed. Many modifications andvariations will be apparent to those of ordinary skill in the art. Theembodiment was chosen and described in order to best explain theprinciples of the invention, the practical application, and to enableothers of ordinary skill in the art to understand the invention forvarious embodiments with various modifications as are suited to theparticular use contemplated.

1. A method of performing branch prediction in a computer program,comprising the steps of: identifying a plurality of branch instructionsfor application code being compiled; associating a plurality of hardwarecounters with the plurality of branch instructions; using the pluralityof hardware counters to autonomically count all of the plurality ofbranch instructions that are executed in parallel to generate aplurality of branch statistics; predicting branches to be taken usingthe plurality of branch statistics to form branch predictions; andprefetching the plurality of branch instructions using the plurality ofbranch predictions.
 2. The method of claim 1, wherein the plurality ofbranch instructions are associated with the plurality of branchstatistics, and wherein the plurality of branch statistics are stored ina the plurality of branch statistic fields.
 3. The method of claim 2,wherein the plurality of branch statistic fields store a plurality ofdata on an associated branch instruction, wherein a first datum of theplurality of data is accessed for branch prediction when the program isin a first mode, and wherein a second datum of the plurality of data isaccessed for branch prediction when the program is in a second mode. 4.The method of claim 2, wherein the plurality of branch statistic fieldsinclude a branch count per instruction field that represents the numberof times a branch is taken for that branch instruction.
 5. The method ofclaim 1, wherein upon occurrence of a predetermined event, the computerprogram switches branch prediction operating modes on a conditionalbranch instruction.
 6. The method of claim 1, wherein the plurality ofbranch statistics is stored in a performance instrumentation shadowcache.
 7. The method of claim 1, wherein branches per instruction arecounted during execution of the computer program.
 8. A branch predictionapparatus, comprising: a compiler that identifies a plurality of branchinstructions for application code being compiled; a plurality ofhardware counters associated with the plurality of branch instructionsof the application code; a plurality of branch statistic fields forstoring a plurality of branch statistics associated with the pluralityof branch instructions; wherein when a branch instruction in theplurality of branch instructions is executed in the application code, ahardware counter of the plurality of hardware counters autonomicallycounts all of the plurality of branch instructions that are executed andupdates in parallel branch statistics in the plurality of branchstatistic fields; a processor that predicts branches to be taken usingthe plurality of branch statistics to form branch predictions; and theprocessor prefetches the plurality of branch instructions using branchpredictions.
 9. The apparatus of claim 8, wherein the plurality ofbranch statistics is used to make branch predictions in the applicationcode.
 10. The apparatus of claim 8, further comprising a plurality ofoperating modes of the application code, wherein for a first branchinstruction, an associated branch statistics field stores first branchstatistics for a first mode of the plurality of operating modes andsecond branch statistics for a second mode of the plurality of operatingmodes.
 11. The apparatus of claim 8, wherein the plurality of branchstatistic fields include a branch count per instruction field thatrepresents the number of times a branch is taken for that branchinstruction.
 12. The apparatus of claim 8, wherein upon occurrence of apredetermined event, the program switches branch prediction operatingmodes on a conditional branch instruction.
 13. The apparatus of claim 8,wherein the plurality of branch statistics is stored in a performanceinstrumentation shadow cache.
 14. The apparatus of claim 8, whereinbranches per instruction are counted during execution of the program.15. A computer program product in a recordable-type computer readablemedium, comprising: instructions for identifying a plurality of branchinstructions for application code being compiled; instructions forassociating a plurality of hardware counters with the plurality ofbranch instructions; instructions for autonomically counting all of theplurality of branch instructions that are executed in parallel using theplurality of hardware counters to thereby generate a plurality of branchstatistics; instructions for predicting branches to be taken using theplurality of branch statistics to form branch predictions; andinstructions for executing the application code using the branchpredictions.
 16. The computer program product of claim 15, wherein theplurality of branch instructions are associated with the plurality ofbranch statistics, and wherein the plurality of branch statistics arestored in a plurality of branch statistic fields.
 17. The computerprogram product of claim 16, wherein the plurality of branch statisticfields store a plurality of data on an associated branch instruction,wherein a first datum of the plurality of data is accessed for branchprediction when the program is in a first mode, and wherein a seconddatum of the plurality of data is accessed for branch prediction whenthe program is in a second mode.
 18. The computer program product ofclaim 16, wherein the plurality of branch statistic fields include abranch count per instruction field that represents the number of times abranch is taken for that branch instruction.
 19. The computer programproduct of claim 15, wherein upon occurrence of a predetermined event,the computer program switches branch prediction operating modes on aconditional branch instruction.
 20. The computer program product ofclaim 15, wherein the plurality of branch statistics is stored in aperformance instrumentation shadow cache.
 21. The computer programproduct of claim 15, wherein branches per instruction are counted duringexecution of the computer program.